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ABSTRACT 

Problems with performance assessment (PA) and 
multiple-choice tests (MCTs) are outlined, with reference to the 
literature on accountability. PA for individual teachers who should 
integrate their assessments with their instruction; PA as a 
supplement to more traditioneil examinations for licensure decisions; 
and some limited, experimental tryouts of PA for other accountability 
purposes are supported. The antl-MCT demagogues, and maxing PA the 
latest fad are not supported, i^easons for PA's popularity include: 
old (but inaccurate) criticisms of MCTs in terms of bias. Irrelevant 
content, and measurement of only recognition; cognitive 
psychologists' belief that many parameters that they want to study 
require formats other than MCT questions; increased concern that MCTs 
delimit the domains that should be assessed: wide publicity of the 
Lake Wobegon effect of teaching too closely to MCTs; and claims that 
teaching to MCT formats has deleterious instructional/learning 
effects. PA problems vary depending on several dimensions, such as 
secure versus non-secure assessments, matrix versus every student 
assessment, and accountability versus instruction. pAs have 
difficulty meeting the five "apple** criteria required of high-stakes 
tests used for accountability purposes: administrative feasibility, 
professional credibility, public acceptability, legal defenslbillty, 
and economical af fordablllty. It is concluded that MCTs measure some 
things very well and efficiently; however, they do not measure 
everything and their use can be overemphasized. PAs can measure 
important objectives that cannot easily be measured by MCTs. A 
52-item list of references is included. (RLC) 
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USING PERFORMANCE ASSESSMENT FOR ACCOUNTABILITY PURPOSES: 

SOME PROBLEMS 



William A, Mehrens 

DEFINITION OF PERFORMAjjrrR ASSESSMENT 

As Fitzpatrick and Morrison pointed out twenty years ago, "there is no 

absolute distinction betwecii peri:ormance tests and other classes of tests" 

(1971, p. 238), The disti iction is the degree to which the criterion 

situation is simulated* Typically what users of the term mean is that the 

assessment will require the examinee to construct an original response. Some 

people seem to call short answer questions or fill in the blank questions 

performance assessments* However, it is more common in performance 

assessment for the examiner to observe the process of the construction so 

there is heavy reliance on observation and professional judgment in the 

evaluation of the response. One of my favorite examples of a performance 

assessment question has been graciously provided by Steve Koffler (1990). 

MEDICINE: You have been provided with a razor blade, a piece of gauze, 
and a bottle of Scotch. Remove your appendix. Do not suture until your 
work has been inspected. You have fifteen minutes. 

FAD VERSUS ADVANCRMENT? 
It is easy to be Impressed with the enthusiasm, energy, and optimism 
displayed by those doing research on performance assessment. However, It is 
impossible to be impressed by the lack of objectivity or scientific rigor of 
many of those advocating the current use of performance assessment. 
Unfortunately, some have put on their advocacy hats before the data support 
It. 
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A simple statement of my position Is that I am in favor of performance 

assessment for Individual teachers who should integrate their assessments with 

tht'r Instruction; I am in favor of performance assessment as a supplement to 

more craaitlonal examinations for licensure decisions;^ and I am in favor of 

some limitg4* exoerimenta]^ tryouts of performance assessment for other 

accountability purposes. Many questions must be answered and problems must be 

overcome before it should be used on a wide -scale basis* Further, I ^ 

"anti'* the anti-multiple -choice demagogues; and I am against turning 

performance assessment into the latest fad. 

One of the most important reasons for the continuing existence of the 
educational pendulum is that educators rarely wait for or demand hard 
evidence before adopting new practices on a wide scale (Slavin, 1989, p. 
753). 

WHY -NEW- PERFORMANCE ASSESSMENTS? 

The first point that should be stressed is that performance assessment 

really is not new. It was employed when the Gilead Guards challenged the 

fugitives from Ephraim who tried to cross the Jordan river. 

*Are you a member of the tribe of Ephraim?' they asked. If the man 
replied that he was not, then they demanded, 'Say Shibboleth.' But if he 
could not pronounce the *sh' and said Sibboleth instead of Shibboleth he 
was dragged away and killed. As a result 42 thousand people of Ephraim 
died there at that time (Judges 12: 5-6, The Living Btblp ^. 

That obviously was a performance examination, I point it out because I 

heard a speaker at a recent professional meeting say that "performance tests 

have only been around a couple of years." That person obviously had some gaps 

in his historical knowledge. Even a reading of the twenty year old 

Fitzpatrick and Morrison chapter in the second edition of Educational 



This is due to the high costs of false positives in licensure, 
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Measurement (1971) could have prevented such an Inaccurate statement. 
However, It is true that the popularity of talking about performance 
assessment as the latest solution to our educational problems Is a new 
phenomena . 

Like all '•new" (or recycled) developments (fads) performance assessment 
is backed by a very large number of people for a variety of reasons. Several 
of the major reasons are as follows: (1) the old (but inaccurate) criticisms 
of multiple choice tests; (2) the belief of cognitive psychologists that many 
of the things they are interested in assessing require formats other than 
multiple-choice questions; (3) the increased concern that multiple choice 
tests delimit the domains we should be assesi^lng; (4) the wide publicity of 
the Lake Wobegon effect of teaching too closely to multiple-choice tests; and 
finally, (5) claims that there are deleterious instructional/learning effects 
of teaching to multiple*choice test formats. Certainly these five points are 
related and overlapping, but they will be discussed separately. 

TRADITIONAL (BUT INCORRECT) CRITICISMS OF MULTIPLE -CHOICE TESTS 

There have been three main criticisms of objective paper/pencil tests: 
They are biased, they measure irrelevant content, and the format demands only 
the ability to recognize an answer- -not to actually work problems. 
Bias 

This paper is not the place to refute the bias charge, but much has been 
written aboit that issue and there is a great deal of evidence that most 
objective tests have very little bias. 
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IrralftVMit Content 

The Issue of content relevance is related In part to the issue of 

whether the multiple-choice format can only be used for a limited number of 

educational objectives/goals. But the issues are separable. To give you a 

flavor of the criticism, consider the following quote: 

We're spending hundreds of millions of dollars on tests that don't tell 
us anything about what kids know or know how to do (Shanker, cited in 
Putka, 1989). 

While the above quote was directed more at existing comm'ircial 
standardized tests than the objective format per*se» the rhetoric stems at 
least in part from incorrect beliefs about what multiple -choice tests can 
measure. In addition to the concern about irrelevant content, there is the 
concern about the narrowness of the content and its mismatch with the 
curriculum (see Baker, Freeman, and Clayton, 1991). 

There will never be universal agreement about the goals/objectives of 
education. However, one must keep in mind how standardized multiple -choice 
achievement test domains are determined* They are determined based upon very 
thorough reviews of existing curricula guides and textbooks. These, one would 
assume, have been developed and/or adopted because they have some match to the 
goals of the local schools. Most parents do want their children to learn the 
content domains sampled by multiple-choice standardized achievement tes::s. 

Consider the following quotes: 

Standardized multiple -choice tests have drawn increasing fire as too 
simplistic, measuring the ability to recognize knowledge rather than the 
ability to think and solve problems, an important skill in today's jobs 
(Fiske, 1990, p.l). 
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It's testing for the TV generation- -superficial and passive. We don't 
ask if students can synthesize Information* solve problems or think 
Independently. Ue measure what they can recognize** (Darling-Hammond as 
quoted In Flske, 1990, p.B8). 

The notion that multiple-choice items can not measure higher-order 
thinking skills is unfortunate and incorrect, Forsyth has over the years 
given any number of talks illustrating that multiple -choice achievement test 
items can tap higher-order thinking skills (see, for example, Forsyth, 1990a). 
If his examples have not convinced the doubtful, they simply are not open- 
minded about it' -or perhaps they don't think at a high enough level. Look at 
the sample multiple-choice questions sent to students who register for the 
SAT. You could not possibly answer those questions without engaging in some 
problem solving and/or higher order thinking* 

COGNITIVE PSYCHOLOGISTS' INFUJENCE 

Over the past decade or so, many individuals have been hypothesizing on 
"what cognitive psychology seems to offer to improve educational measurement" 
(Snow and Lohman, 1989, p. 263). As Snow and Lehman state, **raeasurement 
experts now need to know much more of cognitive psychology than they were 
taught or are likely to learn without a precis" (p* 263). It is impossible to 
argue with that point. However, it is possible to argue about just what the 
measurement implications are from the current writings of cognitive 
psychologists. As Snow and Lohman inform us, there are many controversies 
among cognitive psychologists (p. 264). cognitive psychology has its critics 
(viewed as just by Snow and Lohman), and the field has been fragmented and 
noncumulatlve (p, 770). Snow and Lohman suggest that the implications of 
cognitive psychology are largely for measurement research (p. 312), and that 
"cognitive psychology has no ready answers for the educational measurement 
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problems of yesterday, today, or totnorrow" (p. 320), Other researchers 
generally seem to agree with this assessment (see Ohlsson, 1990; Lesgold et 
al,, 1990; and Linn, 1990), None of the researchers referenced are 
suggesting wide adoption of their exploratory research. 
Based on his research, Slegler warm us 

that even seemingly well -documented cognitive psychological models may 
be drastically Incorrect, and that diagnoses of Individuals based on 
these models could only be equally incorrect. .the time does not seem 
ripe to advocate their use in the classrooms (1989, p. 15). 

All this brings to mind the question of how to determine what is new and 
what is true in current cognitive psychology. Bader, in discussing the **new" 
reading objectives quotes Roe, Stoodt, and Burns who stated that "activating 
schemata involves recalling existing schemata :hat are related to a specific 
subject and relating these schemata to the cont^^nt being read. Students must 
activate appropriate schemata." Bader asks us to contrast this statement with 
the following one by Huey published more than 80 years ago in 1908: "When 
reading, the learner forms meaning by reviewing past experiences that given 
images and sounds evoke." (Both quotes taken froa Bader, 1989, p, 627). 

I suspect current theorists would argue that the schemata cheories are 
different from what Huey said in 1908, but I am drawi\ to a statement Bracey 
made recently: 

No current construct is trendier, squishier, and murkier than that of 
^schema' ...(1991, p, 416). 

In spite of the somewhat cautionary tone of the above paragraphs, I am 
convinced that cognitive psychologists do have something to offer those of us 
in measurement. However, I, like Snow and Lohman, think that it Is primarily 
In terms of helping measurement specialists to develop new, and hopefully 
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better, theories. We should not Jump on any "perforroance -assessment -for - 
accountability** band wagon. 

DELIMITED DOKAIN 

Partly as a result of the cognitive psychologists' influence there has 
been increased concern that multiple-choice tests can not assess all the 
important domains of educational goals/objectives. This fact has been known 
almost forever, and the concern is not totally new. Across the decades 
measurement specialists have a greed that objective tests can not adequately 
cover all objectives. For example, no one believes they are a good way to 
measure perceptual motor skills* However, as measurement driven Instruction 
has increased, the concern about the delimitation of the measured domains has 
increased. 

Cognitive psychologists distinguish between declarative and procedural 
knowledge (or content knowledge and process knowledge). As Snow and Lohraan 
point out, all cognitive tasks require both types of knowledge, but different 
tasks differ in the relative demands they place on the two. It Is generally 
accepted that some types of procedural knowledge are 1^9 ^ : amenable to multiple - 
choice types of assessment. The Increased (and in my view correct) push for 
procedural knowledge goals has led to an increase in the attempts to engage in 
performance assessment. However, this should not result in a replacement of 
objective tests. 

As Weinstein and Meyer (1991) make clear in their chapter on the 
implications of cognitive psychology for testing, many different educational 
tasks require simple recall — particularly In the lower grades and in 
introductory courses. Further, experts differ from novices in their knowledge 
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bAfifit and research suggests '^thac domain knowledge is a necessary but 

Insufficient condition for acquiring strategies and expertise" (1991, p, 42). 

One example of the research on the importance of a knowledge base is the 

effects of prior knowledge on reading and as the quote earlier by Huey 

suggests, that is not a new Idea. ^ 

Collis and Romberg, advocates of performance assessment In mathematics. 

admit that multiple -choice items provide 

an efficient and economical means of assessing knowledge of and ability 
in routine calculations, procedures, and algorithms. All seem to agree 
that these skills are still an Important part of mathemrxics 
education. (1991, p. 102, italics added). 

In spite of my belief in the importance of procedural knowledge and the 
importance of doing some assessing by other than multiple-choice testing, I 
remain puzzled by some of the writings regarding this "new" performance 
testing. Some suggest that multiple -choice tests are indirect and what we 
need are more direct measures of achievement. But cognitive psychologists 
focus on processes (such as metacognitions) which are npt amenable to direct 
measurement. They demand Indirect measurement (Weinstein & Meyer, 1991, p. 
49). Baker, Freeman and Clayton were concerned with content -curriculum 
mismatch but found current textbooks did not "allow the development of deep 
understanding" (1991, p. 138), so for their research, they used new material-- 
certainly creating more mismatch. Others, seemingly not too fond of the 
concept of measurement -driven Instruction, wish to use performance tests to 
reform the curriculum, which seems a lot like being In favor of measurement- 
driven instruction to me. Baker et al, were also concerned with pressures to 



See Hlrsch (1988) for a supportive view of the Importance of 
knowledge to read with comprehension or to be culturally 
literate. 
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test in "a relatively limited number of subject matters" (1991, p. 133), and 
Carlson suggests that there has been a narrowing of the curriculum as a result 
of not using performance assessment (cited in Rothman, 1990). But performance 
assessment certainly is less efficient at covering broad domains of subjects 
than are multiple-choice tests. As Finn correctly pointed out, the limited 
number of items on performance tests may narrow the curriculum even more 
(cited in Rothman, 1990). 

Thus, there seems to be confusion regarding the domain issue. Some 
think the problem is that multiple-choice tests do not cover a broad enough 
domain. But performance tests will access narrower domains- -perhaps In more 
depth, Some are concerned with the curriculum- test mismatch and the efforts 
of educators to change the curriculum to Increase the match these people 
generally see measurement -driven instruction as a bad thing. Others are 
interested in using new assessment procedures to reform the curriculum and 
hope there is a teaching to the assessment. All of this confusion gets 
compounded by those who refuse to separate the issues of content vs. form of 
an exam (which are related, but not identical issues) * 

LAKE WOBECX)N EFFECTS 

High stakes tests can lead to teachers teaching too closely to the test, 
thus raising scores without raising the Inferred achievement. Some advocates 
of performance assessment suggest that it is appropriate to teach directly to 



Actually the evidence regarding whether multiple ^choice tests and 
other assessments cover the same domains is quite mixed. Some 
research suggests the same domains/constructs are being measured-- 
other research suggests that there are some differences (Ackerman & 
Smith, 1988; Bennet, et al., 1991; Birenbaum & Tatsuoka, 1987; Farr 
et al.. 1990; Martinez, 1990; Traub & Fisher, 1977; Traub & MacRury, 
1990; Ward, 1982; Ward et al,, 1980). 
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that type of assessment because the instructors will be teaching appropriate 

material In ways they ought to be teaching it. Consider the following quotes. 

teaching to these {California Assessment Program] tests is what we 
want, because the tests are 100% connected with real -world on the job 
performance (Honig, cited in Pipho, 1989, p. 263), 

if schools spend three or four weeks a year teaching to a performance 
based test, at least they'll be teaching things they ought to be 
teaching in ways they ought to be teaching it (Shavelson, cited in 
Rothman, 1989, pp. 12-13). 

However, those who feel that performance assessment is the solution to 
teaching to the test are sadly mistaken. Their reasoning misses the point 
about Inappropriate test preparation. They basically ignore the domain/sample 
problem that is exacerbated when one delimits the sample as one must in a 
performance assessment. 

DELETERIOUS INSTRUCTION 

Tied to all the above issues is the belief that if one tests via a 
multiple-choice test, and if one instructs so that the students will do well 
on the multiple-choice test, the instruction must be deleterious; however, if 
one assesses via performance measures, the instruction will be beneficial. 

It is true that the format of the assessment will have some effect on 
instructional practices, that this effect will be greater if the assessment is 
for high stakes accountability decisions, that answering multiple choice 
questions is not a task that is done a lot outside of school, and that 
excessive instruction tied too closely to an unrealistic form of assessment is 
a poor instructional strategy. Nevertheless, it is not true that performance 
assessment will necessarily lead to high quality instruction. The Honig and 
Shavelson quotes above are just not true. The California Assessment Program's 
(California State Department of Education, 1989) five performance Items in 
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math arm certainly not "100% connected with real* world on the job 
performance. Further, teachers could spend time teaching correct answers to 
these questions without ** teaching things they ought to be teaching In ways 
they ought to be teaching it.** 

Certainly many teachers would not say that the performance assessment of 
teachers has resulted in Increased learning about how to teach. Further, I 
submit that if student performance measures become the criteria for teacher or 
school accountability » teachers will complain about those measures also. It 
is Important to keep in mind Linn^s admonition that we need to do more than 
Just assume that the alternatives to multiple -choice items will have no bad 
side effects of their own (see Moses, 1990). 

Again, I have perhaps sounded cautionary that is the role of a 
person trying to contain a fad. However, writing assessment has probably 
Increased the instruction of writing and that is a good thing. ^ I suspect 
performance assessment of safety procedures in the science laboratories might 
increase the efforts of teachers to teach safety procedures, and that would be 
a g22d tiling. But we must be somew^f y prudent in our charges regarding the 
ills of mulLlple-cholce tests and o ir claims about the wonders of performance 
assessment for instruction. 



Evidence appears mixed on this. Seventy-eight percent of California 
Junior high school teachers said state %rrltlng assessment Increased 
the number of writing assignments given to students (Moses, 1990). 
However, 1988 NAEP data allow the authors to conclude that "the 
recent interest In encouraging writing across the curriculum does 
not appear to have been carried out in practice" (Applebee, et al., 
1990, p. 7). 
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PROBLEMS WITH PERFORMANCE ASSESSMENT FOR ACCOUNTABILITY 



IMPORTANT DIMENSIONS 

Like other forms of assessment, the particular problems that are likely 
to be faced with performance assessment vary somewhat depending on a variety 
of dimension? such as (1) secure vs. non*secure assessments, (2) matrix versus 
every pupil assessment, and (3) accountability vs. instruction. 
Secure vs, non*aecure Ingtmaimts 

One extreme disadvantage of performance assessment is that, with only a 
few questions, there is no way to keep the exact content of the exam secure. 
Once performance assessments have been used, they cannot be reused to test 
the same hlgher^order thinking process. One can memorize the answer to a 
higher-order question Just as well as one can memorize an answer to a basic - 
skills question. Thus, performance assessments will have to be new each year 

adding to the developmental costs and making across-year*comparisons of 
growth very difficult. 

Baker, Freeman, and Clayton took a different approach. They have 
suggested that 

only if the tasks and scoring criteria are made public ...can teachers 
guide students to meet such standards, and then only if the same tasks 
are used (1991, p. 137). 

While I grant that this may be done without corrupting the inference for some 

physical performance tasks (e.g. diving), performance assessment tasks that 

have a metacognltlve component do not allow for such release and reuse of the 

tasks , 
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Different cost Issues arise with these two methods. Assessments that 
would be cost prohibitive for every pupil testing may be reasonable In a 
matrix sampling approach* However, this makes the assessments much less 
useful to Individual teachers* Further, some high stakes tasks such as those 
used for licensure and high school graduation requirements demand every pupil 
testing* 

Accountability vs. instruction 

The title and thrust of this paper Is on the use of performance 
assessment In accountability programs* Yet most of the research and rhetoric 
regarding the advantages of performance assessment has been In the realm of 
Individual pupl) Jlagnosls. When one switches from local classroom assessment 
for Individual diagnostic purposes to mandated assessment for accountability 
purposes, different Issues arise* Most measurement experts I know believe 
that If you use performance assessment for high- stakes accountability 
purposes, the same kinds of problems as have occurred with multiple-choice 
tests will exist* 

High- stakes tests used for accountability purposes need to meet what 
Baratz-Snowden (1990) has referred to as the five "apple" criteria: 
Administratively feasible, £rofesslonally credible, fubllcly acceptable. 
Legally defensible, and Economically affordable*^ I maintain that 
performance assessment Is likely to have difficulty meeting of those 
standards. Currently It appears to meet the professionally credible and 



Admittedly, her writing pertained to licensure tests, but I believe 
the generalization of the criteria to accountability assessment Is 
reasonable * 
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publicly acceptable criteria but that is because it Is In the fad stage. 
More careful scrutiny may change that. 

ADKINISTRATIVELY FEASIBLE/ECONOMICALLY AFFORDABLE 

Because resources are always limited, the costs of performance 

assessment must be of great concern. ETS has reported that 

one state with a strong commitment to educational assessment found that 
redesigning its state program around performance tasks would increase by 
tenfold the cost of the existing state assessment program (1990, p. 6), 

Given my belief that most performance axerclses are not reusable without 

distorting the Inference, there are some very real questions about the 

developmental costs in performance assessment for accountability. 

Even after performance assessments have been r'.eveloped, the costs of 

administering and scoring them are high. Frequently special equipment is 

needed for administration and it is not feasible to have enough copies for 

simultaneous administration. Consider, for example, the four components 

being planned for an assessment of teachers' laboratory skills (Vheeler, 

1990). There will be a pre-obseirvation questionnaire, a pre-observatlon 

conference, an observation and a post-observation conference. The observation 

is to last 30 to 45 minutes. Observers in the pilot study were trained for 

three days. All this will certainly be expensive. 



PUBLICLY ACCEPTABLE 

So far the performance assessment advocates have done a good job with 
public relations. But, as with multiple -choice tests, once they have been 
used awhile for accountability purposes and the teachers complain (correctly) 
about their lack of validity for accountability inferences, there may be a 
reduction in public acceptability. Once the public understands that the costs 
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will be substantially higher, one might expect some loss of acceptance of the 
process. 



LEGALLY DEFENSIBLE 

Legally, performance assessment Is considered a test (Nathan & Casclo, 
1986. p.l). 

Whether that Is how all courts would decide the Issue, prudent 
Individuals developing performance assessments for high- stakes decisions would 
be wise to act as if this were the case.^ Experts for plaintiffs generally 
psychometrically attack tests based on whether the Standard^ (AERA.APA.NCME, 
1985) have been followed. One should expect them to do the same for 
performance assessment. Whether performance assessments can meet the various 
psychometric standards of reliability, validity, etc. is doubtful. But other 
legal concerns also need to be considered* For example, if there is any 
disparate impact on protected groups, how might one deal with the fact that 
graders may be aware of the group status of the students? If there is debate 
about the scoring process will there be documentation of the performance so 
rescorlng can occur? 

FROFESSIOHALLY CREDIBLE 

Professional credibility pertains at least to three overlapping groups: 
teachers, those involved In teacher education, and psychometrlclans . Because 
of effective P.R. and face validity, performance assessment probably has more 
credibility than multiple^cholce testing for the first two groups. It is 
impossible to know if that will continue If performance assessment becomes 

^ See Watson v. Fort Worth Bank and Trust, 1988, for a 
discussion of this Issue In employment testing. 
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widely used for accountability. Certainly vide use would result in more 
scrutiny than such assessments have currently been given , and the whole 
movement could Implode following such scrutiny. Psychometric lans will 
hopefully place or withhold their stamps of approval based on evidepcg 
regarding the psychometric properties of the assessments. This may place them 
at a different place on the credibility continuum from those individuals who 
claim that psychometric properties such as reliability do not matter (who 
cares about random error anyway if we are measuring the ri g^t tiding ) . 

Let us turn to a discussion of some specific psychometric issues: 
validity, reliability, scoring/scaling/equating, and bias. 

Generally, psychometricians believe it is important to validate new 
approaches to testing before any wide implementation (see Nickerson, 1989). 
Unfortunately, others say validity is a •'red herring" (Carlson cited In 
Rothman, 1990, p. 12). 

Performance assessments have face validity --or what Popham (1990), a 
veritable virtuoso of verbosity says can be more pedantically described as 
verisimilitude. Face validity helps in the acceptance of an assessment 
procedure. Some level of face validity is essential for public credibility. 
But it does not take the place of real validity and Is simply not sufficient. 
Yet, many of the advocates of performance assessment act as if it is. 

In studying the validity of performance assessments, one should think 
carefully about whether the right domains are being assessed, whether they are 
well defined, whether they are well sampled, whether- -even if well sampled- - 
one can infer to the domain, and what dlagnostically one can infer if the 
performance is not acceptably high. 
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Correct Domains? 

A wish to assess the correct domains was a major reason for implementing 
performance assessment, and I am, in a general sense, in favor of what 
cognitive psychologists and reform educators are stressing. Nevertheless, the 
appropriateness of performance domains are as subject to debate as are those 
domains assessed via paper/pencil tests. As mentioned earlier, multiple- 
choice tests do not measure everything. But neither do performance 
assessments. And some domains being proposed for performance assessment can 
much more efficiently be measured by multiple -choice tests. In general 
performance assessment measures a narrower domain that m-c testing, but 
assesses it in more depth. Is this good? What narrow domains need to be 
assessed in depth? 

Well Defined Dooalns? 

If one Is satisfied that the right domains are being assessed, one 
should still consider whether they are defined tightly enough. Critics of 
standardized tests have suggested that the domains are not well-enough defined 
in those tests. My feeling is that the domains of multlple-choire achievement 
tests that have been used for accountability purposes have been more rightly 
defined than many performance assessment domains. 

Adequate Saiipllng? 

The major problems for valid performance assessment relate to the 
limited sampling and the lack of generalizability from the limited sample to 
any Identifiable domain. One of the generally accepted advantages of multltle- 
cholce testing is that one can sample a domain much more thoroughly than by 
performance assessments. Because performance assessment takes more time, 
fewer tasks (questions) can be presented. Thus, tne sampling of the domain is 
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less dense. For example in California, there were only five mathematics items 
on their performance assessment. One would be hard pressed to generalize to 
any curricular domain from such a limited sample, 
Generallzability? 

Even if sampling is adequate » there is the question of whether one can 
generalize from the sample to a larger domain. This is dependent upon the 
intercorrelations between the portions of the domain in the sample and those 
portions not in the sample. Certainly research has indicated that higher order 
thinking skills and problem solving are specific to relatively narrow areas of 
expertise and there appears to be little transfer from one subject matter to 
another on these constructs.*^ 

But even within a subject matter area, generallzabllity is "iffy." As 

Herman has pointed out 

research in performance testing demonstrates how fragile Is the 
generallzabllity of performance (1991, p. 157). 

She gives as one example the research that Indicates writing skill does not 

generalize across genres. 

Or consider the generallzabllity of performance in a science laboratory 

assessment* Some research has been conducted In California on the development 

of a science laboratory assessment for new teachers. In their 1990 Final 

Report, Wheeler and Page wisely state that they do not know if their 

prototyplc exercises will generalize 

across different scie; je laboratory situations- -grades K-12;, earth, 
life, and physical sciences; various types of lab activities; different 
groups of students; and different lab setting, including field trips. 
...conclusions about the generallzabllity of the assessment should be 



See Norris (1989), for a discussion of both epistemologlcal and 
psychological generallzabllity of critical thinking. 
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based on a larga-scale field testing that ircludes many more types of 
situations (Wheeler and Page, 1990, 60*61). 

At this point in time we simply do not have enough datL indicating the 
degree to which we can generalize from most of the performance assessments 
that are being conducted. Much of the evidence we do have would suggest that 
generalizablllty Is extremely limited. 

Correct Inferences About Sanple Performance? 

Even if the domain is the correct one, it is well defined, the sample is 
adequate, and generalizablllty is possible, validity problems remain. One 
has been alluded to earlier. If the assessment is not secure, students will 
be taught how to do that particular task. This not only makes the Inference 
to the domain inappropriate, it means one may make an incorrect inference 
about the sample performance. For anything other than a completely physical 
skill (e,g. diving), one Is tjrpically making an Inference about the cognitive 
processes used. But one can memorize reasons as well as facts. Anytime one 
wishes to infer something like a metacognition, it is important that the 
assessment be secure. 

Finally, a threat to validity that deserves mention is the lack of 
ability to make a very precise inference from a poor score on a performance 
assessment. If, for example, one accepts Anderson's (1983) theory of skill 
development, there are three stages: the declarative stage, the knowledge 
compilation stage, and the procedural stage. At which stage is an Individual 
whose skill development is Inadequate? 
Reliability 

There are several threats to reliability in performance assessment. One 
has to do with the small number of Independent observations (the sampling 
problem discussed above) . A second has to do with a lack of internal 
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consistency (also discussed above). A third has to do with the subjectivity 
of the scoring process -- to be discussed below. 

The evidence for performance assessment reliability is apparently so low 
in so many instances that, in a "preemptive counterattack," some advocates of 
performance assessment have told us that reliability is not important. Some 
have gone so far as to suggest that measurement theory is wrong when It says 
reliability is a necessary prerequisite to validity. It is the critics that 
are wrong. Reliability refers to random error in a measurement, and if random 
error Is too great, any perceived relevance of the assessment is illusory 
because nothing is being measured (Fitzpatrlck and Morrison, 1971, p. 268). 
Thus, one can not possibly make any valid inference from the data. 

The only performance assessment area that has reported much evidence on 
reliability has been writing assessment. There, the major evidence reported 
Is rea<^er reliability. It generally runs in the low .80s. To obtain this 
level of reliability is costly. It requires' careful selection of and 
extensive training of the raters, precise scoring guidelines, and periodic 
rechecking of rater performance. Other types of reliability are less often 
reported. For other areas of performance, I have heard rumors that 
preliminary evidence shows Internal consistency reliabilities to be rs low as 
.20. While there surely must be data I have not seen, I believe there are 
serious problems with the reliability of many performance assessments. 
Scgrlll£. gcallng. eouatlnE. and aggrega ting data 

Many issues arise concerning scoring, scaling, equating, and aggregating 
data. The major issuBS in these areas will be highlighted in the following 
sections. 
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Scoring 

It Is obvious that there Is subjectivity in assigning the scores to a 
performance. This means that who does the scoring is very important for any w 
test used for accountability. Some telling data regarding scoring by anyone 
having a vested interest in the results comes from the Judgments of teacher 
performance by principals. State after state has obtained very negatively 
skewed distributions when principals score teacher performance. When 
assessing for accountability purposes, it is imperative to have performances 
scored by those who do ti ? ^ have a vested interest in the outcome. Having 
teachers score their own students' performances will not work. Further, if 
the school building or school district Is being held accountable for the 
scores on performance assessments, the scorers must come from outside the 
district. 

The issue of '•what" is to be scored is also or considerable importance. 
Typically, *'an examinee response is complex and multifaceted, comprising 
multiple, interrelated parts** (Millman and Greene, 1989, p, 344). One can 
either use componential or holistic scoring. As Millman and Greene pointed 
out, In either case, to develop the scoring criteria requires a clear 
understanding of what it means to be proficient in the relevant domain (which, 
in turn, assumes there is a good definition of the domain) « Holistic scores 
are useless for diagnostic/prescriptive purposes so most advocates of 
performance assessment probably will opt for developing scoring profiles (see 
Wolf et al., in press)* The Standards require that the reliabilities of the 
sub -scores need to be reported. Further, if the data are going to be used for 
diagnostic purposes, one should report the reliability of the difference 
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scores. It Is my guess that these will generally be quite low. The profiles 
for students' performances will likely be so unreliable they are useless. 
Scaling 

Determining how to scale the data from performance assessments is 
another challenge. In his paper on ;:he NAEP Proficiency Scales Forsyth (1990 
b), convincingly argues that those scales do xia£ yield valid criterion- 
referenced Interpretation. Large scale performance assessments will likely be 
equally difficult to scale. 

Equating 

Because performance assessments yield fewer independent pieces of data, 
and because specific assessments should not be reused, the equating problems 
seem formidable. I realize that some states have some of the best experts in 
the nation working on this issue • I am not aware of what the proposed 
solutions will be* In any case, for longitudinal comparisons and fairness in 
accountability, the scores on different forms of performance assessments must 
be equated so that they represent the same level of achievement regardless of 
when the performance was assessed, which tasks were given, or which raters 
scored the performance. 

Agg^egatrlng 

Decisions about the unit of reporting will be difficult to make. 
Certainly for those performance assessments that are based on group activities 
the unit can not be the Individual.^ However, other types of assessment may 
lend themselves to Individual reporting. 



See, for example the prototype math exercises for Maryland 
State Department of Education, 1990. 
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Ethnic group diffftr^ncfta 

As mentioned earlier, one of the agendas for moving to performance 
assessments Is that some Individuals believe paper-pencil tests are biased. 
Given the commonly used definitions of bias, the evidence does jjo^ support 
that position. However, some are hopeful that performance assessments will 
show smaller ethnic group differences. The results are not yet all in with 
respect to this hope but evidence on writing assessments across the nation do 
Iiajt show smaller differences between black and white performers than are 
obtained from multiple -choice tests. Further, the data will be more 
complicated to interpret due to the subjective scoring processes and the 
potential opportunity for scorers to allow ethnicity co influence their 
scores . 

CONCLUSIONS/IMPLICATIONS 

As measurement specialists have known for decades, multiple-choice tests 
measure some things very well and very efficiently. Nevertheless, they do not 
measure everything, and their use can be overemphasized. Performance 
assessments have the potential to measure Important objectives that cannot 
easily be measured by multiple-choice tests. 

CONTINUE RESEARCHING BUT 00 NOT OVERSELL PERFORNANCE ASSESSMENT 

Some research has been conducted regarding performance assessment but 
much more research is needed. Like Wolf, et al.(ln press). I would call for 
"mindfulness " (p. 4) in the performance assessment research, and hope the 
researchers would "be as tough-minded in designing new options as [they] are 
In critiquing available testing" (p. 38). Evidence regarding psychome-;ric 
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characteristics must be gathered* One cannot "pursue these new modes of 
assessment •••on the mere conviction that they are better" (p. 41). Finally. 
I agree with Wolf, et al, , and wish to emphasize that researchers should be 
"standing on the shoulders rather than the faces of another generation" (p, 
8), 

While continuing the research, performance advocates should not be 

overselling what performance assessment can do, Wiggins has suggested that 

It's wrong to say [performance assessments] were oversold; they were 
overbought (cited in Rothman, 1990)* 

X do not see it that way. I think they have been both oversold and 

overbought, and the sellers have not been truthful about competitive 

products . 

While standing on the shoulders of another generation, performance 
assessment researchers should not be intentionally or unintentionally 
misinterpreting what that generation has accomplished and the still current 
values of paper-pencil assessments. 

CONTINUE USING HULTIPLE- CHOICE TESTS 

Most large scale assessments have added performance assessments to their 
existing array of efficient paper-pencil tests, not replaced them. This is 
good. There is no question but that the multiple -choice format is the format 
of choice for many assessments especially for measuring declarative 
knowledge. 

CLOSING THOUGHTS/QUOTES 

From at least one point of view* performance assessment is a good thing 
for measurement specialists, and education in general. It has resulted in 
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more money and more resources being devoCed to assessment. This has opened up 

a whole nev assessment Industry. It should result In more research regarding 

the effects of testing on teaching and learning. Nevertheless, I agree with 

Haney and Madaus who suggest that 

the search for alternatives [to multiple -choice tests) is somewhat 
shortsighted (1989, p. 683). 

Ue also need to keep in mind a statement Lennon made more than a decade ago. 

To encourage the innocent to root around in the rubble of discredited 
modes of study of human behavior, In search of some overlooked 
assessment ''Jewels,'* is to dispatch a new band of Argonauts in quest of 
a non-existent Golden Fleece (1981, pp. 3*4). 

Finally, we should heed the wisdom of Boring: 

The seats on the train of progress all face backwards; you can see the 
past but only guess about the future (1963, p. 5). 
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